33 research outputs found
Distributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and
they also ignore semantics of the words. For example, "powerful," "strong" and
"Paris" are equally distant. In this paper, we propose Paragraph Vector, an
unsupervised algorithm that learns fixed-length feature representations from
variable-length pieces of texts, such as sentences, paragraphs, and documents.
Our algorithm represents each document by a dense vector which is trained to
predict words in the document. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-words models. Empirical results
show that Paragraph Vectors outperform bag-of-words models as well as other
techniques for text representations. Finally, we achieve new state-of-the-art
results on several text classification and sentiment analysis tasks
Legal Judgement Prediction for UK Courts
Legal Judgement Prediction (LJP) is the task of automatically predicting the outcome of a court case given only the case document. During the last five years researchers have successfully attempted this task for the supreme courts of three jurisdictions: the European Union, France, and China. Motivation includes the many real world applications including: a prediction system that can be used at the judgement drafting stage, and the identification of the most important words and phrases within a judgement. The aim of our research was to build, for the first time, an LJP model for UK court cases. This required the creation of a labelled data set of UK court judgements and the subsequent application of machine learning models. We evaluated different feature representations and different algorithms. Our best performing model achieved: 69.05% accuracy and 69.02 F1 score. We demonstrate that LJP is a promising area of further research for UK courts by achieving high model performance and the ability to easily extract useful features
TransNets: Learning to Transform for Recommendation
Recently, deep learning methods have been shown to improve the performance of
recommender systems over traditional methods, especially when review text is
available. For example, a recent model, DeepCoNN, uses neural nets to learn one
latent representation for the text of all reviews written by a target user, and
a second latent representation for the text of all reviews for a target item,
and then combines these latent representations to obtain state-of-the-art
performance on recommendation tasks. We show that (unsurprisingly) much of the
predictive value of review text comes from reviews of the target user for the
target item. We then introduce a way in which this information can be used in
recommendation, even when the target user's review for the target item is not
available. Our model, called TransNets, extends the DeepCoNN model by
introducing an additional latent layer representing the target user-target item
pair. We then regularize this layer, at training time, to be similar to another
latent representation of the target user's review of the target item. We show
that TransNets and extensions of it improve substantially over the previous
state-of-the-art.Comment: Accepted for publication in the 11th ACM Conference on Recommender
Systems (RecSys 2017
Improving Negative Sampling for Word Representation using Self-embedded Features
Although the word-popularity based negative sampler has shown superb
performance in the skip-gram model, the theoretical motivation behind
oversampling popular (non-observed) words as negative samples is still not well
understood. In this paper, we start from an investigation of the gradient
vanishing issue in the skipgram model without a proper negative sampler. By
performing an insightful analysis from the stochastic gradient descent (SGD)
learning perspective, we demonstrate that, both theoretically and intuitively,
negative samples with larger inner product scores are more informative than
those with lower scores for the SGD learner in terms of both convergence rate
and accuracy. Understanding this, we propose an alternative sampling algorithm
that dynamically selects informative negative samples during each SGD update.
More importantly, the proposed sampler accounts for multi-dimensional
self-embedded features during the sampling process, which essentially makes it
more effective than the original popularity-based (one-dimensional) sampler.
Empirical experiments further verify our observations, and show that our
fine-grained samplers gain significant improvement over the existing ones
without increasing computational complexity.Comment: Accepted in WSDM 201
Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces
With the rise of machine learning, there is a great deal of interest in
treating programs as data to be fed to learning algorithms. However, programs
do not start off in a form that is immediately amenable to most off-the-shelf
learning techniques. Instead, it is necessary to transform the program to a
suitable representation before a learning technique can be applied.
In this paper, we use abstractions of traces obtained from symbolic execution
of a program as a representation for learning word embeddings. We trained a
variety of word embeddings under hundreds of parameterizations, and evaluated
each learned embedding on a suite of different tasks. In our evaluation, we
obtain 93% top-1 accuracy on a benchmark consisting of over 19,000 API-usage
analogies extracted from the Linux kernel. In addition, we show that embeddings
learned from (mainly) semantic abstractions provide nearly triple the accuracy
of those learned from (mainly) syntactic abstractions
Jointly Learning Word Embeddings and Latent Topics
Word embedding models such as Skip-gram learn a vector-space representation for each word, based on the local word collocation patterns that are observed in a text corpus. Latent topic models, on the other hand, take a more global view, looking at the word distributions across the corpus to assign a topic to each word occurrence. These two paradigms are complementary in how they represent the meaning of word occurrences. While some previous works have already looked at using word embeddings for improving the quality of latent topics, and conversely, at using latent topics for improving word embeddings, such "two-step'' methods cannot capture the mutual interaction between the two paradigms. In this paper, we propose STE, a framework which can learn word embeddings and latent topics in a unified manner. STE naturally obtains topic-specific word embeddings, and thus addresses the issue of polysemy. At the same time, it also learns the term distributions of the topics, and the topic distributions of the documents. Our experimental results demonstrate that the STE model can indeed generate useful topic-specific word embeddings and coherent latent topics in an effective and efficient way
Deep Learning Based Multi-Label Text Classification of UNGA Resolutions
The main goal of this research is to produce a useful software for United
Nations (UN), that could help to speed up the process of qualifying the UN
documents following the Sustainable Development Goals (SDGs) in order to
monitor the progresses at the world level to fight poverty, discrimination,
climate changes. In fact human labeling of UN documents would be a daunting
task given the size of the impacted corpus. Thus, automatic labeling must be
adopted at least as a first step of a multi-phase process to reduce the overall
effort of cataloguing and classifying. Deep Learning (DL) is nowadays one of
the most powerful tools for state-of-the-art (SOTA) AI for this task, but very
often it comes with the cost of an expensive and error-prone preparation of a
training-set. In the case of multi-label text classification of domain-specific
text it seems that we cannot effectively adopt DL without a big-enough
domain-specific training-set. In this paper, we show that this is not always
true. In fact we propose a novel method that is able, through statistics like
TF-IDF, to exploit pre-trained SOTA DL models (such as the Universal Sentence
Encoder) without any need for traditional transfer learning or any other
expensive training procedure. We show the effectiveness of our method in a
legal context, by classifying UN Resolutions according to their most related
SDGs.Comment: 10 pages, 10 figures, accepted paper at ICEGOV 202